Serial Analysis of Gene Expression (SAGE) - Sequencing Errors
نویسنده
چکیده
Serial Analysis of Gene Expression (SAGE) is a technique to study overall gene expression in different (normal or disease) tissues. Results take a form of a so-called SAGE library for each of the tissues studied. A SAGE library is a set of text-strings (typically 10base-pairs long), called tags. A tag is representative for a gene that is active in a particular cell or tissue. From a statistical point of view, a SAGE library is a series of counts, corresponding to the number of occurrences of different tags the higher the count or, equivalently, the percentage of a particular tag in a library, the higher activity of the particular gene represented by that tag. By comparing the percentage of occurrence of the tag in libraries derived from different tissues one can learn about possible differences in genetic activity. If during the sequencing procedure an error is made and one or more nucleotides are incorrectly read, the tag sequence becomes incorrect and this fact complicates the estimation of the true unobserved frequencies of tags. Recent work studies the effect of sequencing errors on the statistical type I error in modeling SAGE data. The effect is studied using simulated SAGE data. Colinge and Feger [1] proposed a concept of neighborhood, where the idea is that the highly abundant tags can influence frequency of tags with a very similar sequence. Computer simulations were conducted in order to investigate the effect of sequencing error in SAGE data on the statistical type I error. Two different settings were considered. In the first one, the same expression levels for the tag which was of interest and for the closest neighbors (tags differing in one nucleotide letter) were assumed. Under the second setting, different expression levels for the tag of interest and for the closes neighbors were assumed. The model used in this work is the beta-binomial model introduced by Baggerly et al. [2][3]. This choice of model allows incorporating correctly both sources of variation typical for SAGE data. The set of statistical tests is chosen according to available literature and consists of tests suited for comparing overdispersed binary data. Conclusions: Disease or gene disorders are known to change the distribution of tag counts within particular SAGE library. When comparing libraries analyzing SAGE data the situation may arise where the candidate tag of interest has the distribution of neighboring tags changed. Hence, if the distribution changes in such a way the total proportion of 30 neighboring tags changes, one can get intro troubles with sequencing error. Especially when the count of tag of interest is low and/or the difference between the proportions of neighbors for biological settings being compared is high. This situation needs to be avoided or the observed counts of tags being analyzed have to be adjusted. Acknowledgement: Research partially supported by the grant AV0Z10300504.
منابع مشابه
Correction of sequence-based artifacts in serial analysis of gene expression
MOTIVATION Serial Analysis of Gene Expression (SAGE) is a powerful technology for measuring global gene expression, through rapid generation of large numbers of transcript tags. Beyond their intrinsic value in differential gene expression analysis, SAGE tag collections afford abundant information on the size and shape of the sample transcriptome and can accelerate novel gene discovery. These la...
متن کاملStatistical modeling of sequencing errors in SAGE libraries.
MOTIVATION Sequencing errors may bias the gene expression measurements made by Serial Analysis of Gene Expression (SAGE). They may introduce non-existent tags at low abundance and decrease the real abundance of other tags. These effects are increased in the longer tags generated in LongSAGE libraries. Current sequencing technology generates quite accurate estimates of sequencing error rates. He...
متن کاملIdentification and prevention of a GC content bias in SAGE libraries.
Serial Analysis of Gene Expression (SAGE) is becoming a widely used gene expression profiling method for the study of development, cancer and other human diseases. Investigators using SAGE rely heavily on the quantitative aspect of this method for cataloging gene expression and comparing multiple SAGE libraries. We have developed additional computational and statistical tools to assess the qual...
متن کاملSerial Analysis of Gene Expression: Applications in Human Studies
Serial analysis of gene expression (SAGE) is a powerful tool, which provides quantitative and comprehensive expression profile of genes in a given cell population. It works by isolating short fragments of genetic information from the expressed genes that are present in the cell being studied. These short sequences, called SAGE tags, are linked together for efficient sequencing. The frequency of...
متن کاملAn anatomy of normal and malignant gene expression.
A gene's expression pattern provides clues to its role in normal physiology and disease. To provide quantitative expression levels on a genome-wide scale, the Cancer Genome Anatomy Project (CGAP) uses serial analysis of gene expression (SAGE). Over 5 million transcript tags from more than 100 human cell types have been assembled. To enhance the utility of this data, the CGAP SAGE project create...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005